library(tidyverse)
library(ggplot2)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 7 Solutions
Challenge Overview
Today’s challenge is to:
- read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
- tidy data (as needed, including sanity checks)
- mutate variables as needed (including sanity checks)
- Recreate at least two graphs from previous exercises, but introduce at least one additional dimension that you omitted before using ggplot functionality (color, shape, line, facet, etc) The goal is not to create unneeded chart ink (Tufte), but to concisely capture variation in additional dimensions that were collapsed in your earlier 2 or 3 dimensional graphs.
- Explain why you choose the specific graph type
- If you haven’t tried in previous weeks, work this week to make your graphs “publication” ready with titles, captions, and pretty axis labels and other viewer-friendly features
R Graph Gallery is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code. And anyone not familiar with Edward Tufte should check out his fantastic books and courses on data visualizaton.
(be sure to only include the category tags for the data you use!)
I am going to work with my data set for my final project, which is a data set from AirBnB.
# read in data:
setwd("C:/Users/caitr/OneDrive/Documents/DACSS/601_Fall_2022/posts")
Error in setwd("C:/Users/caitr/OneDrive/Documents/DACSS/601_Fall_2022/posts"): cannot change working directory
<- read_csv("Boston AirBnB Data.csv") Boston
Error: 'Boston AirBnB Data.csv' does not exist in current working directory ('C:/Users/srika/OneDrive/Desktop/601_Fall_2022/posts').
I am using data from the Inside AirBnB website capturing data related to summary information and metrics for listings in Boston, MA.
Tidy Data (as needed):
# look for duplicates
# look for missing values
# look for outliers - review how to clean
# remember na.rm=TRUE for calculations
# clean the data:
# At first glance, it seems as though there are no values in the column titled "neighbourhood_group." So, I will find all unique values within that column to determine whether it can be removed from my tidy data set.
unique(Boston[c("neighbourhood_group")])
Error in unique(Boston[c("neighbourhood_group")]): object 'Boston' not found
# I now know that there is no data within this column. I will remove it from my data set.
<- subset(Boston, select = -c(neighbourhood_group)) Boston_tidy
Error in subset(Boston, select = -c(neighbourhood_group)): object 'Boston' not found
# I can see from viewing this data frame that there are no other columns that are absent any values, so I will move on to other tidying tasks.
# rename columns:
names(Boston_tidy) <- c('room_id', 'room_name', 'host_id', 'host_name', 'neighborhood', 'room_latitude', 'room_longitude', 'room_type', 'room_price', 'min_nights', 'number_reviews', 'last_review', 'reviews_per_month', 'host_listings', 'availability_next_365', 'number_reviews_LTM', 'room_license')
Error in names(Boston_tidy) <- c("room_id", "room_name", "host_id", "host_name", : object 'Boston_tidy' not found
# find duplicates:
<- duplicated(Boston_tidy) duplicates
Error in duplicated(Boston_tidy): object 'Boston_tidy' not found
# reached "max.print", so I will increase the limit and identify if any values within the vector = TRUE:
options(max.print=999999)
"TRUE"] duplicates[
Error in eval(expr, envir, enclos): object 'duplicates' not found
# "TRUE" = NA, so I now know that there are no duplicates in my data set.
“Boston_tidy” represents AirBnB rental listing data for the city of Boston over the last twelve months. The data frame has 17 variables and 5,185 rows of data. Each row now represents one unique observation—or in this case, a unique rental listing—that includes data related to the following variables: (1) room/listing ID number, (2) name of the room/listing, (3) listing host ID number, (4) listing host name, (5) room/listing neighborhood, (6) room/listing latitude, (7) room/listing longitude, (8) type of room/listing, (9) room/listing price, (10) minimum number of nights for rent, (11) number of room/listing reviews, (12) most recent room/listing review, (13) number of room/listing reviews per month, (14) number host-specific listings, (15) room/listing availability over the next year, (16) number of reviews for room/listing over the past 12 months, and (17) room/listing licensure status.
I will now generate summary statistics:
# summary statistics for numeric variables:
%>%
Boston_tidy select(room_price, min_nights, number_reviews, host_listings, availability_next_365) %>%
summary()
Error in select(., room_price, min_nights, number_reviews, host_listings, : object 'Boston_tidy' not found
Mutate
First, I will mutate the “latitude” and “longitude” variables to create a “coordinates” variable.
# mutate lat and lon to create "room_coordinates"
# keep lat and lon columns for now
<- Boston_tidy %>%
Boston_mutate mutate("room_coordinates" = paste(room_latitude, room_longitude))
Error in mutate(., room_coordinates = paste(room_latitude, room_longitude)): object 'Boston_tidy' not found
data.frame()
data frame with 0 columns and 0 rows
Next, I will generate a new data set that contains a variable for median room prices:
# find median room prices by neighborhood and room type:
<- Boston_mutate%>%
Boston_median filter(room_price>0) %>%
group_by(room_type, neighborhood)%>%
summarize(median_price = median(room_price))
Error in filter(., room_price > 0): object 'Boston_mutate' not found
Visualization with Multiple Dimensions
I will next generate visualizations with multiple dimensions.
# generate stacked bar chart indicating percent out of 100
library(RColorBrewer)
library(ggplot2)
library(ggtext)
Error in library(ggtext): there is no package called 'ggtext'
ggplot(Boston_tidy, aes(x=neighborhood, y=room_price, fill=neighborhood)) + geom_col(stat="identity") +
scale_fill_hue() +
theme_classic() +
theme(axis.text.x = element_markdown(angle=90, hjust=1))
Error in ggplot(Boston_tidy, aes(x = neighborhood, y = room_price, fill = neighborhood)): object 'Boston_tidy' not found
This is interesting, but not a great way to visualize this data, as we don't get to see the distribution of data points. I will next generate a geom_point plot that omits outliers and contains a boxplot overlay.
# remove outliers:
<- function(x) {
is_outlier return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}
<- Boston_mutate %>%
Boston_trim filter(!is_outlier(room_price))
Error in filter(., !is_outlier(room_price)): object 'Boston_mutate' not found
# create dataframe:
%>%
Boston_trimfilter(room_price>0, room_price<600) %>%
group_by(room_type, neighborhood)
Error in filter(., room_price > 0, room_price < 600): object 'Boston_trim' not found
#generate geom_point chart
%>%
Boston_trimgroup_by(room_type, neighborhood)%>%
ggplot(aes(x=neighborhood, y=room_price)) +
geom_point(alpha=.08, size=5, color = "light pink")+
labs(x="Neighborhood",y="Price Per Night", title = "Boston Airbnb Rental Prices by Neighborhood")+
theme_classic() +
geom_boxplot()
Error in group_by(., room_type, neighborhood): object 'Boston_trim' not found
theme(axis.text.x = element_markdown(angle=90, hjust=1))
Error in element_markdown(angle = 90, hjust = 1): could not find function "element_markdown"
I will next generate a violin plot that also omits outliers:
# generate violin plot:
%>%
Boston_trimggplot(aes(x=room_type, y=room_price, fill=neighborhood)) +
geom_violin(width = 1, size = .5,
scale = "area",
trim = TRUE)+
coord_flip()+
labs(x="Room Type", y="Room Price", title = "Price by Room Type and Neighborhood")
Error in ggplot(., aes(x = room_type, y = room_price, fill = neighborhood)): object 'Boston_trim' not found
Given the number of neighborhoods, this is extremely difficult to read. I will use facet wrapping to adjust the visualization.
# violin plot with facet wrapping:
%>%
Boston_trim ggplot(aes(x=neighborhood, y=room_price, fill=room_type)) +
geom_violin(width = 1, size = .5,
scale = "area",
trim = FALSE)+
facet_wrap(vars(neighborhood), scales = "free")+
theme_bw()+
labs(x="Neighborhood", y = "Price", fill= "Room Type", title = "Boston Airbnb Prices by Neighborhood and Room Type", subtitle= "2021")
Error in ggplot(., aes(x = neighborhood, y = room_price, fill = room_type)): object 'Boston_trim' not found
I want to figure out how to adjust the scale of the violin plots themselves so that I can read the data. I have tried adjusting the scale in the facet wrapping code chunk, but that hasn't worked so far. I will continue working on this.
I will next generate a tree map. It won't be helpful in terms of visualizing prices, but I wanted to give it a shot as practice.
library(treemap)
library(RColorBrewer)
%>%
Boston_tidytreemap(room_type,
index= c("neighborhood", "room_type"),
vSize = "room_price",
type = "index",
title = "Median Listing Price by Neighborhood, Airbnb NYC 2019",
overlap.labels = 0) +
scale_color_brewer(palette = "PiYG")
Error in treemap(., room_type, index = c("neighborhood", "room_type"), : object 'Boston_tidy' not found
# I'm not sure why the Brewer palette isn't applying to the tree map.